Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: evaluation ingestion (no user-facing feature is added) #1764

Merged
merged 19 commits into from
Nov 21, 2023

Conversation

RogerHYang
Copy link
Contributor

@RogerHYang RogerHYang commented Nov 16, 2023

Purpose

  1. Provides a Jupyter notebook and experimental helper functions for
    • extracting spans from Session
    • running evals on those spans
    • ingesting eval results back to Session
  2. Adds internal data structures for eval ingestions and gql queries

Changes

  1. Defines proto for Evaluation
  2. Adds Evaluation to HttpExporter
  3. Adds fixture parquet files for evaluations
  4. Adds http endpoint and receiver for Evaluation
  5. Adds px.core.evals.Evals (analogous to px.core.traces.Traces) to store the received Evaluation
  6. Attaches px.core.evals.Evals to px.session.session.Session
  7. Adds evaluations to the spans query in GraphQL
  8. Adds GraphQL query to retrieve all available span evaluation names
  9. Adds Jupyter Notebook for ingesting evaluations after running llm_classify
  10. Adds helper functions for the Jupyter Notebook, e.g. to extract spans from the Phoenix session

Caveats

  1. Duplicate evaluations, when ingested for a second time, overwrites the existing ones

GraphQL Sample Output

Use the trace fixture llama_index_rag.

Span Evaluation Names

Screenshot 2023-11-16 at 2 15 29 PM

GraphQL Query

query Query {
  spanEvaluationNames
}

Span Evaluations and Document Evaluations

Screenshot 2023-11-16 at 2 17 06 PM

GraphQL Query

query Query {
  spans(filterCondition:"name == 'query' or span_kind == 'RETRIEVER'") {
    edges {
      node {
        name
        context {
          spanId
        }
        input {
          value
        }
        spanEvaluations {
          name
          score
          label
          explanation
        }
        documentEvaluations {
          name
          documentPosition
          score
          label
          explanation
        }
      }
    }
  }
}

@Arize-ai Arize-ai deleted a comment from review-notebook-app bot Nov 16, 2023
@RogerHYang RogerHYang marked this pull request as ready for review November 16, 2023 22:09
@RogerHYang RogerHYang changed the title feat: evaluation ingestion feat: evaluation ingestion (no user-facing feature is added) Nov 16, 2023
Copy link
Contributor

@mikeldking mikeldking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Niiice! Left some comments but you walked me through it so will give others a chance to take a look before stamping. Ping me tomorrow.

label: String
explanation: String
spanId: String!
documentPosition: Int!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docs: Adding descriptions for the fields would be useful

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

score: Float
label: String
explanation: String
spanId: String!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to expose the spanId here? If so this becomes a bit less generic. Totally a nit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I was just copy/pasting what I had in mind for the Python code.

score: Float
label: String
explanation: String
spanId: String!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above - maybe this was for troubleshooting but not needed I think?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. will remove. thx


@strawberry.type
class SpanEvaluation(Evaluation):
span_id: str
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can mark this as private

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

@@ -122,6 +124,14 @@ class Span:
description="Cumulative (completion) token count from self and all "
"descendant spans (children, grandchildren, etc.)",
)
span_evaluations: List[SpanEvaluation] = strawberry.field(
description="Span evaluations",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit, no need to repeat the name - best to be more verbose and informative in the descriptions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, that makes sense

Comment on lines 157 to 163
span_evaluations: List[SpanEvaluation] = []
document_evaluations: List[DocumentEvaluation] = []
span_id = span.context.span_id
for evaluation in evals.get_evaluations_by_span_id(span_id) if evals else ():
span_evaluations.append(SpanEvaluation.from_pb_evaluation(evaluation))
for evaluation in evals.get_document_evaluations_by_span_id(span_id) if evals else ():
document_evaluations.append(DocumentEvaluation.from_pb_evaluation(evaluation))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimization: You can place this code of getting evaluations and documents on the Span node. This will alleviate the load of the query if the query doesn't ask for anything related to evals. If there are multiple fields that require evaluations by span_id, you can wrap that in a dataloader to eliminate the n+1. Can do this as a follow-up but it's a worthwhile refactor as it's always good to eliminate over-fetching if possible.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some links to dataloading: https://leebyron.com/dataloader-v2/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yea, that's right. we had talked about this before. I totally forgot about it

Comment on lines +164 to +168
Thread(
target=_load_items,
args=(evals, fixture_evals, simulate_streaming),
daemon=True,
).start()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

future thought: I might need to have a "dirty" bit to know when evals change so I know how to refetch due to the lack of subscriptions :(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how exactly. let's catch up later

Comment on lines 59 to 67
for index, row in evaluations.iterrows():
subject_id = _extract_subject_id(cast(Union[str, Tuple[str]], index), index_names)
result = _extract_result(row)
evaluation = pb.Evaluation(
name=evaluation_name,
result=result,
subject_id=subject_id,
)
exporter.export(evaluation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimization: this pretty inefficient overall - I get that we are trying to leverage existing code but I think uploading this in bulk / chunks feels much more practical?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, totally agree. I was just trying to limit the scope of this PR. Will definitely upgrade in a future PR

Comment on lines 73 to 74
if index_names and index_names[0].endswith("span_id"):
if len(index_names) == 2 and index_names[1].endswith("document_position"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's a bit of magic here that would not be intuitive to the reader. Can you figure out how to maybe leverage variable names and a bit of doc-strings to make this easier to groc? Without being intimately familiar with the structure of the data, I think this will go over a reader's head.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. will add docstring

@mikeldking
Copy link
Contributor

Minor optimization on the PR comment - Might help if you put graphql codeblocks so others can copy / paste and try out the queries for themeselves.

Copy link
Contributor Author

@RogerHYang RogerHYang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added sample graphql queries to the PR description

score: Float
label: String
explanation: String
spanId: String!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I was just copy/pasting what I had in mind for the Python code.


@strawberry.type
class SpanEvaluation(Evaluation):
span_id: str
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea!

Comment on lines 59 to 67
for index, row in evaluations.iterrows():
subject_id = _extract_subject_id(cast(Union[str, Tuple[str]], index), index_names)
result = _extract_result(row)
evaluation = pb.Evaluation(
name=evaluation_name,
result=result,
subject_id=subject_id,
)
exporter.export(evaluation)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup, totally agree. I was just trying to limit the scope of this PR. Will definitely upgrade in a future PR

Comment on lines 73 to 74
if index_names and index_names[0].endswith("span_id"):
if len(index_names) == 2 and index_names[1].endswith("document_position"):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree. will add docstring

score: Float
label: String
explanation: String
spanId: String!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. will remove. thx

@@ -122,6 +124,14 @@ class Span:
description="Cumulative (completion) token count from self and all "
"descendant spans (children, grandchildren, etc.)",
)
span_evaluations: List[SpanEvaluation] = strawberry.field(
description="Span evaluations",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, that makes sense

Comment on lines 157 to 163
span_evaluations: List[SpanEvaluation] = []
document_evaluations: List[DocumentEvaluation] = []
span_id = span.context.span_id
for evaluation in evals.get_evaluations_by_span_id(span_id) if evals else ():
span_evaluations.append(SpanEvaluation.from_pb_evaluation(evaluation))
for evaluation in evals.get_document_evaluations_by_span_id(span_id) if evals else ():
document_evaluations.append(DocumentEvaluation.from_pb_evaluation(evaluation))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh yea, that's right. we had talked about this before. I totally forgot about it

Comment on lines +164 to +168
Thread(
target=_load_items,
args=(evals, fixture_evals, simulate_streaming),
daemon=True,
).start()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure how exactly. let's catch up later

label: String
explanation: String
spanId: String!
documentPosition: Int!
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do

Copy link
Contributor

@mikeldking mikeldking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,102 @@
from typing import Any, Iterable, List, Mapping, Optional, Tuple, Union, cast
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo in the name of this file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very good eye! thanks for the catch!

Comment on lines 36 to 37
except Exception:
return Response(status_code=HTTP_422_UNPROCESSABLE_ENTITY)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we return a 500 here instead of a 422?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch! it's intended for line 31, so this is actually not the right place to put it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i also confirmed that 500 is automatic via starlette: we don't need to catch anything for it
Screenshot 2023-11-20 at 1 13 26 PM

@axiomofjoy
Copy link
Contributor

looking really good @RogerHYang

@RogerHYang RogerHYang merged commit 7c4039b into main Nov 21, 2023
10 checks passed
@RogerHYang RogerHYang deleted the evaluation-ingestion branch November 21, 2023 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

3 participants